Intro: Some notes for filtering, exploring, and conducting some basic statistical analyses on raw climate data from local station observations using R. This script was originally written in the context of preparing climate input data for a WEAP model, and is a good initial step in this process. The script was prepared by the Stockholm Environment Institute (SEI) Water Group (see SEI Water Group’s github page here).
Access to raw data and Rmarkdown files: The Rmarkdown file used to generate this page is stored on github, here. The example data/files used in this script is also stored on github, here.
Last Updated: Manon von Kaenel, Sept 26th 2016
Go over some common workflows relevant to exploring, filtering, and correcting climate station data. The goals of these initial analyses of climatological time-series are to familiarize yourself with your climate data; specifically, to:
The knowledge and understanding generated by these analyses can inform the following workflows:
This example will use precipitation and temperature records from stations in the Chancay Lambayeque watershed, Peru.
Data used for the examples are hosted on github, here. The zipped file contains the following…
You can download/unzip the data by hand or run the following code to download and unzip the data to a temporary directory.
setwd(tempdir()) # set working directory to temporary directory
dir <- getwd()
download.file("https://github.com/seiwater/Raw-Data/raw/master/rawclimatedataR.zip", destfile = 'rawclimatedataR.zip')
unzip( 'rawclimatedataR.zip' , exdir = dir)
Some functions we will use come the following packages, so be sure to have them installed prior to attempting this exercise:
You can download these packages with…
install.packages('ggplot2', repos = "http://cran.cnr.berkeley.edu/",lib=dir)
install.packages('rgdal',repos="http://cran.cnr.berkeley.edu/",lib=dir, type="source")
install.packages('raster', repos = "http://cran.cnr.berkeley.edu/",type="source",lib=dir)
library(ggplot2)
library(rgdal)
library(raster)
First, let’s get all the data you’ll need for this script loaded into your working directory.
setwd(dir)
Precip <- read.csv('Table_All_Precip.csv', check.names = F, stringsAsFactors = F)
Temp_Min <- read.csv('Table_All_Temp_Min.csv', check.names = F, stringsAsFactors = F)
Temp_Max <- read.csv('Table_All_Temp_Max.csv', check.names = F, stringsAsFactors = F)
Coordinates <- read.csv('Coordinates.csv', check.names = F, stringsAsFactors = F)
project <- readOGR(dir,'Proyecto_CuencaCL')
That was easy!
As a side note, the ‘check.names = F’ parameter in the ‘read.csv’ function preserves the column headers from the original record table.
Your input climate tables should look like this (in this case, the first couple months on record have missing data for the stations displayed here):
head(Precip[,1:7])
## Anio Mes Dia ASUNCION Augusto Weberbauer AYLAMBO
## 1 1 1960 1 1 NA NA NA
## 2 2 1960 1 2 NA NA NA
## 3 3 1960 1 3 NA NA NA
## 4 4 1960 1 4 NA NA NA
## 5 5 1960 1 5 NA NA NA
## 6 6 1960 1 6 NA NA NA
Note that the first three columns correspond to date information: year, month, and day. The script runs with R date objects, so we will be converting these three columns to dates later. Your data may have a different format for dates; regardless, this script runs with date objects (in the form of YYYY-MM-DD). Also, note that the example data is in daily format; however, the script should also work for monthly data.
And the coordinate table should look like this, with spatial information saved in lat and long coordinates:
head(Coordinates)
## Station Longitude Latitude
## 1 1 ASUNCION -78.51550 -7.312100
## 2 2 Augusto Weberbauer -78.48480 -7.166611
## 3 3 AYLAMBO -78.52528 -7.186389
## 4 4 BAMBAMARCA -78.52528 -6.677500
## 5 5 CACHACHI -78.27111 -7.457500
## 6 6 CAJABAMBA -78.05056 -7.626944
Let’s first look at the temporal distribution of your data; that is, the period of record for each climate variable at each station. This will help us assess the amount of missing data we are dealing with, and designate an optimal period of record for your project.
This first chunk of script opens up a path in your working directory to save a PDF of the resulting period of record graph.
filename <- "Period of Record"
plotpath <- c(dir, "\\",filename, ".pdf") # USER-DEFINED creates a pdf path to produce a graphic of the span of records in the Table_All
plotpath <- paste(plotpath, collapse = '')
pdf(file = plotpath) # opens connection to pdf path
par(mfrow=c(1,1)) # parameter that dictates how many rows and columns of graphics to be present in the output pdf. (1,1) implies one graphic per page, in this case there is only 1.
Next, we want to make sure we have a column corresponding to the date at which each measurement is taken. In this case, we want to combine the “year”, “month”, and “day” columns. We will use the Precipitation data as our first example here.
Table_All <- Precip # designate which climate table you want to be working on here.
date1 <- paste(c(Table_All[1,2],"/",Table_All[1,3],"/",Table_All[1,4]),collapse="") # create date object of first row of data
date2 <- paste(c(Table_All[nrow(Table_All),2],"/",Table_All[nrow(Table_All),3],"/",Table_All[nrow(Table_All),4]),collapse="") # create date object of last row of data
years <- seq(as.Date(date1),as.Date(date2),"days") # create date sequence for entire series
Table_All[,1] <- NULL # deletes unnecessary columns
Table_All[,1] <- NULL
Table_All[,1] <- NULL
Table_All[,1] <- NULL
Table_All <- cbind(years,Table_All) # adds date column
This is how your table should look like now (showing the first couple columns & rows). Note the date column (“years”).:
## years ASUNCION Augusto Weberbauer AYLAMBO BAMBAMARCA
## 1 1960-01-01 NA NA NA NA
## 2 1960-01-02 NA NA NA NA
## 3 1960-01-03 NA NA NA NA
## 4 1960-01-04 NA NA NA NA
## 5 1960-01-05 NA NA NA NA
## 6 1960-01-06 NA NA NA NA
Next, let’s create a Record Table, which will contain information as to the presence/absence of a data point at each timestep for each station.
stations <- colnames(Table_All[,2:ncol(Table_All)])
Record_Table <- !is.na(Table_All)
Record_Table[Record_Table == TRUE] <- 1
Record_Table[Record_Table == 0] <- NA
Finally, let’s plot the period of record for each station.
Let’s see the resulting plot. Note that this plot doesn’t contain any information about the observations themselves; instead, it simply plots the existence of data at each timestep with a blue line. Gaps in the lines thus indicate missing data. The stations are labeled and stacked on the y-axis.
We’ll repeat the process to see the period of record for minimum temperature at each station….
… and for maximum temperature.
Now, let’s take a look at how the stations are spatially distributed within your study site, and explore which stations have precipitation and/or temperature data available within any designated period of record. This will help us assess the availability of data in a spatial sense, again informing what an optimal period of record may be, and helping to inform us about any holes in data that may affect future workflows and analyses.
First, let’s define the longitude and latitude of your study site, the zoom level at which you want to see the base map, and the title of you map.
lon_project <- -79.25 # USER-DEFINED: longitude of project (for map)
lat_project <- -6.75 # USER-DEFINED: latitude of project (for map)
zoom_map <- 8 # USER-DEFINED: zoom for satellite map
map_title <- "Climate Stations Chancay Lambayeque" # title of map
Now, let’s use the ggmap and ggplot2 packages to plot the locations of your climate stations, color-coded by what kind of climate data each one has, on a base map of nice satellite imagery. We will use the shapefile of the study site that we loaded in at the beginning of this script.
stations_temp <- sub("\\.","", sub("\\.", " ", sub("\\.", " ", colnames(Temp_Min[5:ncol(Temp_Min)])))) # NOTE: this step is to eliminate any weird characters in the station names from the precip and temp data tables; may be different for your project!!!
stations_precip <- sub("\\.","", sub("\\.", " ", sub("\\.", " ", colnames(Precip[5:ncol(Precip)])))) # NOTE: this step is to eliminate any weird characters in the station names from the precip and temp data tables; may be different for your project!!!
coordinates_temp <- Coordinates[Coordinates[,2] %in% stations_temp,] ## Coordinates for stations from which we are using temp data
coordinates_precip <- Coordinates[Coordinates[,2] %in% stations_precip,] ## Coordinates for stations from which we are using precip data
install.packages("ggmap", repos = "http://cran.cnr.berkeley.edu/") # make sure you have ggmap installed for use of its mapping functions
library(ggmap)
maptest <- get_map(location=c(lon=lon_project, lat=lat_project), zoom = zoom_map, maptype="satellite",scale=2, source="google",filename="maptest", force = ifelse(source == "google", TRUE, TRUE)) # saves a google map of your project area
ggmap(maptest) + # plots the backgound map
geom_polygon(aes(x=long, y=lat),size=1,colour='black',data=project,alpha=0) + # plots the polygon of your project area
geom_text(data = Coordinates, aes(x = Coordinates[,"Longitude"], y = Coordinates[,"Latitude"], label = Coordinates[,"Station"]),
size = 1.5, vjust = 0, hjust = -0.2, color="light grey") + # plots the names of all the stations
geom_point(data = as.data.frame(Coordinates), aes(x=as.numeric(Coordinates[,"Longitude"]), y=as.numeric(Coordinates[,"Latitude"])),colour='black',size=3.5,shape=1) + # plots all the stations in black circles
geom_point(data = as.data.frame(coordinates_precip), aes(x=as.numeric(coordinates_precip[,"Longitude"]), y=as.numeric(coordinates_precip[,"Latitude"])),colour='blue',size=1.5,shape=6) + # plots the precipitation stations in blue
geom_point(data = as.data.frame(coordinates_temp), aes(x=as.numeric(coordinates_temp[,"Longitude"]), y=as.numeric(coordinates_temp[,"Latitude"])),colour='orange',size=1.5,shape=2) + # plots the temperature stations in orange
labs(title=map_title) # adds a title to the map
ggsave("Map_Stations.jpg") # saves the map as a JPG
Let’s see how the map looks like. We’ve defined the empty black circles as the locations of the climate stations in your dataset. If this circle contains a blue triangle, the station has precipitation data; an orange triangle indicates temperature data.
The third lens through which to view your climate data is statistically; that is, looking at the climate measurements themselves. We will be plotting the time-series graphs for precipitation, minimum temperature, and maximum temperature at each of your stations individually, as well as on a single plot. We will also calculate basic statistical indicators for each time series: average, standard deviation, minimum, maximum, and percent of missing data. This portion of the script can be iterative: that is, you can come back to and re-run this script after correcting your data, and/or after filling in missing data, to better visualize and understand the data.
Let’s first define the time frame for which you want to plot the station time series.
year1 <- 1970 ##USER-DEFINED first year
year2 <- 2010 ##USER-DEFINED last year
Now, we are going to load in your data (this example here shows precipitation data; you can easily repeat the process for other climate data by simply changing the “data” and “clima.var” variables). This script will plot the station time series and save them to a PDF, and save the summary statistical indicators to a table.
setwd(dir)
data <- read.csv("Table_All_Precip.csv",check.names=F) #loading in data
clima.var <- "Precip" #defining the title of your output documents
data <- subset(data, data$Anio >= year1)
data <- subset(data, data$Anio <= year2)
# Creates a sequence of dates from first year of record to last year of record (for plotting purposes)
date1 <- paste(c(data[1,2],"/",data[1,3],"/",data[1,4]),collapse="")
date2 <- paste(c(data[nrow(data),2],"/",data[nrow(data),3],"/",data[nrow(data),4]),collapse="")
years <- seq(as.Date(date1,format="%Y/%m/%d"),as.Date(date2,format="%Y/%m/%d"),"days")
# Opens up plot path for output PDF
plotpath <- c(dir, "\\", paste(c(clima.var,"_","Daily_Obs_All"),collapse=""), ".pdf") #USER-DEFINED: name of daily plot output file
plotpath <- paste(plotpath, collapse = '')
pdf(file = plotpath)
par(mfrow=c(3,1))
#Creates table with %NA, mean, max, min for each station
allstations <- colnames(data[,5:length(colnames(data))])
matrix <- matrix(data=NA, nrow=length(allstations),ncol=6)
colnames(matrix) <- c("Station","Percent_NA","Max","Min","Mean","SD")
matrix[,1] <- as.character(allstations)
# Calculates NA_percent, max, min, and mean of each data column, saves it to a table and plots each column of data on a separate plot (including % NA on plot)
for (i in 1:length(allstations))
{
p <- i+4
NA_percent <- round(sum(is.na(data[,p]))/nrow(data)*100, 2)
Mean <- mean(as.numeric(data[,p]),na.rm=TRUE)
Max <- max(as.numeric(data[,p]),na.rm=TRUE)
Min <- min(as.numeric(data[,p]),na.rm=TRUE)
SD <- sd(as.numeric(data[,p]),na.rm=TRUE)
matrix[i,2] <- NA_percent
matrix[i,3] <- Max
matrix[i,4] <- Min
matrix[i,5] <- round(Mean,3)
matrix[i,6] <- round(SD, 3)
if(NA_percent < 100) # only plots if there is at least 1 value in the time series during the time frame selected
{
plot(years,as.character(data[,p]),type ="p", main =allstations[i],ylab=clima.var,col="blue",sub=paste(c("% values NA = ",NA_percent),collapse=""),cex=0.2)
}
}
dev.off()
setwd(dir)
write.csv(matrix,paste(c("SummaryTimeSeries_",clima.var,".csv"),collapse=""), row.names=FALSE)
Let’s take a look at a sample plot. Note that the plot contains a line describing what percent of the particular time series is missing (aka NA).
The above chunk of script also calculates the percent of missing data, minimum, maximum and average value for each station and saves it in a table (called “SummaryTimeSeries”). Let’s take a look at the first couple lines of that table here.
## Station Percent_NA Max Min Mean SD
## 1 ASUNCION 4.93 80.2 0 2.225 5.688
## 2 Augusto Weberbauer 0.61 40.5 0 1.791 3.843
## 3 AYLAMBO 80.87 50.1 0 1.754 3.917
## 4 BAMBAMARCA 9.08 51.5 0 2.044 4.379
## 5 CACHACHI 1.47 63.9 0 2.477 5.825
## 6 CAJABAMBA 8.61 66.2 0 2.678 5.712
We can repeat the process above for minimum temperature and maximum temperature, simply by changing the first two variables (data and clima.var). Let’s take a look at a sample plot and the data summary table for minimum temperature….
## Station Percent_NA Max Min Mean SD
## 1 ASUNCION 41.67 16.5 1.2 12.013 1.708
## 2 Augusto Weberbauer 0.41 14.6 -4.3 7.52 3.041
## 3 AYLAMBO 65.87 15 1 9.724 1.676
## 4 BAMBAMARCA 0 16 -5.8 9.684 2.957
## 5 CACHACHI 100 -Inf Inf NaN <NA>
## 6 CAJABAMBA 100 -Inf Inf NaN <NA>
… and maximum temperature.
## Station Percent_NA Max Min Mean SD
## 1 ASUNCION 41.67 31.2 13.4 22.276 2.859
## 2 Augusto Weberbauer 0.41 26.4 13.4 21.562 1.659
## 3 AYLAMBO 65.87 28.8 10.7 21.021 2.466
## 4 BAMBAMARCA 0 26.7 11 19.748 1.972
## 5 CACHACHI 100 -Inf Inf NaN <NA>
## 6 CAJABAMBA 100 -Inf Inf NaN <NA>
Armed with the knowledge and understanding you’ve gained about your data, let’s move on to subsetting your data to a designated period of record, and to the final stations you want to include in the next steps of your analysis.
year1 <- 1970 ##USER-DEFINED first year
year2 <- 2015 ##USER-DEFINED last year
Temp_Min <- subset(Temp_Min, Temp_Min$Anio >= year1) # Subsets data frame for which the year is greater than the previously defined start year
Temp_Min <- subset(Temp_Min, Temp_Min$Anio <= year2) # Selects data frame for which the year is less than the previously defined end year
Temp_Max <- subset(Temp_Max, Temp_Max$Anio >= year1) # Subsets data frame for which the year is greater than the previously defined start year
Temp_Max <- subset(Temp_Max, Temp_Max$Anio <= year2) # Selects data frame for which the year is less than the previously defined end year
Precip <- subset(Precip, Precip$Anio >= year1) # Subsets data frame for which the year is greater than the previously defined start year
Precip <- subset(Precip, Precip$Anio <= year2) # Selects data frame for which the year is less than the previously defined end year
Note that at this point in the process, you can also eliminate those stations you’ve identified as too faulty, or with too much missing data, or too far from your study area. Here, we’ll eliminate # 6. Identifying and removing errors & extremes
After you’ve familiarized ourselves with your data, the next step is to identify errors and anomalies in your time series. This requires an understanding of what kinds of climate values are appropriate for your study site.
First, let’s eliminate negative precipitation precipitation values, as well as any nonnmeric values in all three climate data sets. Note that these nonnumeric values can be code for something else, depending on how the measurements were taken. In this case, it may be more appropriate to replace these with known numeric values (for example, a “T” in precipitation data may stand for “Traces” and could be replaced with a “0” value).
NewPrecip <- sapply(Precip[,5:ncol(Precip)],as.numeric) # Formats precip data as numeric (which automatically replaces any nonnumeric values with "NA")
neg_number_precip <- length(which(NewPrecip<0)) # Returns number of times a precip value is < 0
NewPrecip[NewPrecip < 0] <- NA # Replaces precip values < 0 with NA
neg_number_precip2 <- length(which(NewPrecip<0)) # Returns number of times a precip value is < 0 after fixing (for purposes of checking work). Should be 0.
Precip <- cbind(Precip[,1:4],NewPrecip) # Re-saves "fixed" precip data as "Precip" R object
Temp_MinNum <- sapply(Temp_Min[,5:ncol(Temp_Min)],as.numeric) # Changes the values in the min Temp data table to numeric format
Temp_Min <- cbind(Temp_Min[,1:4],Temp_MinNum)
Temp_MaxNum <- sapply(Temp_Max[,5:ncol(Temp_Max)],as.numeric) # Changes the values in the min Temp data table to numeric format
Temp_Max <- cbind(Temp_Max[,1:4],Temp_MaxNum)
Now, let’s define some threshholds to determine what an “extreme” (high or low) measurement would be. You can determine these threshholds from local knowledge or by referring to the plots or statistical summaries we’ve previously created. Note that the threshholds defined below are relevant to our example study site, the Chancay Lambayeque watershed in Peru.
max_Tmin <- 35 ##USER-defined: threshold for reasonable max of Tmin
max_Tmax <- 41 ##USER-defined: threshold for reasonable Tmax max
min_Tmin <- -10.5##USER-defined: threshold for reasonable Tmin min
min_Tmax <- 9 ##USER-defined: threshold for reasonable Tmax min
max_Precip <- 146 ##USER-defined: threshold for reasonable precip max
Now, let’s use our dataset for minimum temperature as an example for how to identify and eliminate the values above or below these threshholds. This script identifies these extreme values and saves them to a table exported to your directory for reference; eliminates extreme values in your original dataset, and saves a new extreme-free table of station data to your directory.
Numeric <- Temp_Min[,5:ncol(Temp_Min)] # defines variable
number_below <- length(which(Numeric<min_Tmin)) # Finds the number of times a minimum temperature value falls below the user-defined acceptable minimum for min temperature
number_above <- length(which(Numeric>max_Tmin)) # Finds the number of times a minimum temperature value falls above the user-defined acceptable maximum for min temperature
number_below # Prints the number of extreme low values (for reference, to see in code output)
## [1] 0
number_above # Prints the number of extreme high values (for reference, to see in code output)
## [1] 9
errormatrix <- matrix(data=NA, nrow=ncol(Numeric)*nrow(Numeric),ncol=3) # creates empty error matrix to be filled in in next loop
if(number_below > 0) # If there are extreme values, loops through station data to select the "extreme" values and saves them in a table
{
for (c in 1:ncol(Numeric)) # cycles through data table to look for extreme values
{
for (r in 1:nrow(Numeric))
{
if (!is.na(Numeric[r,c]) && Numeric[r,c] < min_Tmin) # if a value is identified as "extreme", it is saved into the error matrix
{ row <- r + (c-1)*nrow(Numeric)
names <- colnames(Numeric)
errormatrix[row,1] <- names[c] # the station in which the extreme value ocurred is saved into the error matrix
errormatrix[row,2] <- paste(c(Temp_Min[r,2],"-",Temp_Min[r,3],"-",Temp_Min[r,4]),collapse="") # the date on which the extreme value ocurred is saved into the error matrix
errormatrix[row,3] <- Numeric[r,c] # the value of the extreme value is saved into the error matrix
}
}
}
}
errormatrix3 <- na.omit(errormatrix) # lines in the error matrix which remained empty are eliminated from the table
colnames(errormatrix3) <- c("Station","Date","Value_Extreme Low") # column names for the error matrix are defined
errormatrix <- matrix(data=NA, nrow=ncol(Numeric)*nrow(Numeric),ncol=3) # creates empty error matrix to be filled in in next loop
if(number_above > 0) ## Same loop/logic as above, but looking for extreme high values
{
errormatrix <- matrix(data=NA, nrow=ncol(Numeric)*nrow(Numeric),ncol=3)
for (c in 1:ncol(Numeric))
{
for (r in 1:nrow(Numeric))
{
if (!is.na(Numeric[r,c]) && Numeric[r,c] > max_Tmin)
{ row <- r + (c-1)*nrow(Numeric)
names <- colnames(Numeric)
errormatrix[row,1] <- names[c]
errormatrix[row,2] <- paste(c(Temp_Min[r,2],"-",Temp_Min[r,3],"-",Temp_Min[r,4]),collapse="")
errormatrix[row,3] <- Numeric[r,c]
}
}
}
}
errormatrix2 <- na.omit(errormatrix)
colnames(errormatrix2) <- c("Station","Date","Value_Extreme High")
errormatrix_res <- rbind(errormatrix2,errormatrix3) # combines the error matrices for the extreme high values and extreme low values
Numeric[Numeric > max_Tmin] <- NA # replaces the extreme high values in the original data table with NA
Numeric[Numeric < min_Tmin] <- NA # replaces the extreme low values in the original data table with NA
Temp_Min_C <- cbind(Temp_Min[,2:4],Numeric) # formats the final output data table to include the year, month, day columns
setwd(dir)
write.csv(Temp_Min_C, "Table_All_NoExt_Temp_Min.csv") # Saves the intermediate data table (subsetted and without extreme values) in the output directory.
setwd(dir)
write.csv(errormatrix_res, "Temp_Min_Extremes.csv") # Saves the error matrix containing the extreme high and extreme low values in the output directory.
Let’s take a look at a sample of the extreme values the script has identified for our minimum temperature dataset.
## Station Date Value_Extreme High
## 1 ASUNCION 2012-3-25 112.2
## 2 CAYALTI 2014-2-20 212
## 3 CELENDIN 2008-4-10 211
## 4 LA ENCAÑADA 2003-4-12 92
## 5 LA ENCAÑADA 2005-2-21 1103
## 6 MAGDALENA 2013-12-10 174
We can repeat the above process for our maximum temperature dataset. Let’s skip forward and see a sample of the extreme values the script identified in this dataset.
## Station Date Value_Extreme High
## 1 ASUNCION 2013-6-14 242
## 2 ASUNCION 2013-11-3 210.6
## 3 ASUNCION 2013-12-26 216
## 4 BAMBAMARCA 2011-12-26 177
## 5 CONTUMAZA 2013-6-25 206
## 6 EL ESPINAL 2011-12-14 286
The last dataset needing some cleaning up is precipitation; because we don’t have a minimum threshhold for precipitation (we’ve already removed those observations below 0), the process for identifying and removing extremes is slightly different. Let’s see.
Numeric <- Precip[,5:ncol(Precip)]
number_above <- length(which(Numeric>max_Precip))
number_above # the number of values above the threshhold
## [1] 18
errormatrix <- matrix(data=NA, nrow=ncol(Numeric)*nrow(Numeric),ncol=3) # creates empty error matrix to be filled in in next loop
if(number_above > 0) ## Loops through station data to select the "extreme" values and saves them in a table
{
for (c in 1:ncol(Numeric))
{
for (r in 1:nrow(Numeric))
{
if (!is.na(Numeric[r,c]) && Numeric[r,c] > max_Precip)
{ row <- r + (c-1)*nrow(Numeric)
names <- colnames(Numeric)
errormatrix[row,1] <- names[c]
errormatrix[row,2] <- paste(c(Precip[r,2],"-",Precip[r,3],"-",Precip[r,4]),collapse="")
errormatrix[row,3] <- Numeric[r,c]
}
}
}
}
errormatrix2 <- na.omit(errormatrix)
colnames(errormatrix2) <- c("Station","Date","Value")
setwd(dir)
write.csv(errormatrix2, "Precip_Extremes.csv")
Numeric[Numeric > max_Precip] <- NA #Replaces extreme values with NA
Precip_C <- cbind(Precip[,2:4],Numeric)
setwd(dir)
write.csv(Precip_C, "Table_All_NoExt_Precip.csv",row.names = F)
Let’s see what extreme precipitation values the script has identified (and removed from the dataset).
## Station Date Value
## 1 CHOTA 1980-10-17 160.5
## 2 CONTUMAZA 1979-3-5 387
## 3 CONTUMAZA 1979-3-7 160
## 4 CONTUMAZA 1979-3-19 280
## 5 CONTUMAZA 1979-3-20 347
## 6 CONTUMAZA 1979-3-24 251
This process may be iterative as you adjust the minimum and maximum threshholds for your climate data. Note that there may exist other anomalies and errors in your dataset, which would require some more detailed coding and more local knowledge of climate conditions. These additional anomalies may include: high precipitation values in the dry season, non-sequential temperature or precipitaiton observations, etc.
Finally, we want to identify those data values that correspond to a very specific error in your dataset: dates in which the recorded maximum temperature is less than the recorded minimum temperature. This script will identify the dates in which this occurs and save the dates as well as the observations causing this error in a table exported to your directory.
Temp_Min <- Temp_Min_C #renames the variables with "corrected" dataset
Temp_Max <- Temp_Max_C #renames the variables with "corrected" dataset
temp_stations <- intersect(colnames(Temp_Min),colnames(Temp_Max)) # creates variable containing the names of the stations that have both min and max temperature measurements
Temp_Min <- subset(Temp_Min, select=temp_stations) # subsets the minimum temperature data table to select only those columns (ie stations) that have both min and max temperature measurements
Temp_Max <- subset(Temp_Max, select=temp_stations) # subsets the maximum temperature data table to select only those columns (ie stations) that have both min and max temperature measurements
errormatrix <- matrix(data=NA, nrow=ncol(Temp_Min)*nrow(Temp_Min),ncol=4) # sets up empty matrix to be filled in with the errors
colnames(errormatrix) <- c("Station","Date","Temp_Min","Temp_Max") # names the column of this error matrix
for (c in 1:ncol(Temp_Min)) # cycles through the columns of the data table (to look for errors)
{
for (r in 1:nrow(Temp_Min)) # cycles through teh rows of each column in the data table (to look for errors)
{ if (!is.na(as.numeric(Temp_Min[r,c])) && !is.na(as.numeric(Temp_Max[r,c]))) # only looks for errors if both temp_min and temp_max have a measurement for that timestep
{
if (as.numeric(Temp_Min[r,c]) > as.numeric(Temp_Max[r,c])) # selects the row/column in which minimum temperature is greater than maximum temperature for a timestep
{ row <- r + (c-4)*nrow(Temp_Min)
errormatrix[row,1] <- colnames(Temp_Max[c]) # saves the station name in which this error ocurrs in the error matrix
errormatrix[row,2] <- paste(c(Temp_Max[r,2],"-",Temp_Max[r,3],"-",Temp_Max[r,4]),collapse="") # saves the date on which this error ocurrs in the error matrix
errormatrix[row,3] <- Temp_Min[r,c] # saves the minimum temperature value which caused this error in the error matrix
errormatrix[row,4] <- Temp_Max[r,c] # saves the maximum temperature value which caused this error in the error matrix
}
}
}
}
errormatrix2 <- na.omit(errormatrix) # removes empty lines from the matrix
write.csv(errormatrix2, "Temp_Errors.csv") # saves the error matrix as a csv table in the output directory
Let’s take a look at the errors the script has identified.
## Station Date Temp_Min Temp_Max
## 1 CAYALTI 11-21-23.6 27.8 27.6
## 2 FERREÑAFE 4-2-22 25 24.8
## 3 GRANJA PORCON 3-7-NA 15.2 14.2
## 4 GRANJA PORCON 3-8-NA 15.4 14.1
## 5 GRANJA PORCON 3-11-NA 15.3 12
## 6 GRANJA PORCON 3-13-NA 21.2 15.4
## 7 GRANJA PORCON 3-15-NA 21.2 16.7
## 8 GRANJA PORCON 3-19-NA 18.1 18
## 9 JAEN 12-2-NA 29 27.8
## 10 JAYANCA(LA VIÑA) 4-19-23.4 32.4 18
## 11 LLAPA 2-7-18.1 21.8 18.7
## 12 LLAPA 2-9-19.1 21.2 19
## 13 LLAPA 2-11-22.7 28 18.2
## 14 LLAPA 2-17-18.9 28 18.4
## 15 LLAPA 2-18-19.3 18.2 17.4
## 16 LLAPA 2-19-19 27 19.4
## 17 LLAPA 2-20-17.9 30.4 21.8
## 18 LLAPA 2-21-18.3 30 19.6
## 19 LLAPA 2-22-21.6 24 21
## 20 MAGDALENA 4-2-24.4 28.6 28
## 21 NAMBALLE 8-3-NA 27.2 26
## 22 NAMBALLE 8-10-NA 27.2 26
## 23 NAMBALLE 8-12-NA 28.5 27
Because these errors may have been caused by a variety of reasons (human error when recording observations, a glitch in recording equipment, etc), you will have to manually navigate to those values in your original dataset and fix them as you see most appropriate.
Once you have acquainted yourself with your data and eliminated any clear extreme/error values within your datasets, you will still need to fill in missing data. This workflow is described in detail at the following link.
And with that, we conclude this script covering how to conduct some initial analyses of climatological time series. Note that there exist other methods to identify and correct a time series of climate data; the process we’ve introduced here serves as an initial step in this process. Potential future steps include further refining and correcting your time series before filling in missing data. Further applications include preparing climate input series for a WEAP model.